Semantic segmentation task requires building masks of objects in original resolution. But traditional convolutional networks constuct feature maps of a smaller and coarse size.
How to deal with it?
Deconvolutional networks extend standard convolutional models with a series of unpooling convolutions that gradually return spatial dimension back to its original size.
While reconstructing high-resolution output, such networks use almost no context information. Nevertheless this architecture works well and it showed pretty good results.
So these deconvolutional networks basically learn how to interpolate.
One way to preserve resolution without upsampling is to set all convolution strides to 1.
This lets us remove all the poloing layers from the network. And this is the main idea of Fully Convolutional Networks. FCNs are built only of convolutions. Now it's a mainstrem architecture.
But if we want to have large receptive field (to use large kernels), we'll have to deal with thousands of parameters. That won't work well.
Do FCNs user dilated onvolutions too?
Dilated (or atrous) convolutions are convolutions with a sparse kernel. Only a subset of parameters in such kernels are learnable, others are always set to zero.
Dilation factor = rate of pixels that are learnable along a dimension.
Purpose: Dilated convolutions make it possible to increase the receptive field (size of the kernel) while maintaning the same number of parameters.
Of course they may fail to detect some high frequency patterns, but in practice they tend to work pretty well.
So, the fully convolutional approach with usage of atrous convolutions is a more effective way of maintaining original spatial resolution:
Let's consider traditional classification model:
State-of-the art network archtectures for segmentation utilize the following 3 ideas:
They append a Fully Convolutional part with dilated convolutions
Complete FCN would be too cumbersome, so there is still some downsampling on first layers; thus there is an upsampling transformation before the output
They replace fully-connected dense head with a series of 1x1 convolutions
They do x8 bicubic interpolation to get back to the original resolution
The algorithm was proposed in 2017 and became state-of-the-art in Semantic Segmentation on Pascal VOC dataset.
Main ideas:
CRF is a probabilistic graphical model with multiple variables when each one depends on other.
CRFs are standard Computer Vision trick to smoothen class boundaries, known from early 2000s. We don't want too much jittering in pixel classes.
We need to model probability of every segmentation (given our coarse scores) and choose the segmentation that is most probable.
If we consider mask pixels independent, probability of segmentation is just a product of pixel probabilities. But of course they are not and each mask pixel depends on all other pixels.
Calculating the probabilities for all possible segmentations is intractable, but we can decompose it to product of independent blocks of pixels.
We assume that pixel probability is product of
If there is no interactions, we would just assign each pixel to a class with a highest probability. That would be our most probable mask.
How interactions are coded:
That way we find a segmentaion with least possible inconsistencies, when penalties are applied only on color borders.
Inference for CRFs is iterative. On each iteration class probabilities for each pixel are updated.
Those updates can be implemented as a convolution with gaussian filter.
Couple of optimizations: